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t Abstract 

We are interested in risk constraints for infinite horizon discrete time Markov decision processes 
(MDPs). Starting with average reward MDPs, we show that increasing concave stochastic dominance 
constraints on the empirical distribution of reward lead to linear constraints on occupation measures. 
The optimal policy for the resulting class of dominance-constrained MDPs is obtained by solving a 
linear program. We compute the dual of this linear program to obtain average dynamic programming 
optimality equations that reflect the dominance constraint. In particular, a new pricing term appears 
in the optimality equations corresponding to the dominance constraint. We show that many types of 
stochastic orders can be used in place of the increasing concave stochastic order. We also carry out a 
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parallel development for discounted reward MDPs with stochastic dominance constraints. The paper 
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concludes with a portfolio optimization example. 



1 Introduction 



Markov decision processes (MDPs) are a natural and powerful framework for stochastic control problems. 
In the present paper, we take up the issue of risk constraints in MDPs. Convex analytic methods for 
MDPs have been successful at handling many types of constraints. Our specific goal is to find and study 
risk constraints for MDPs that are amenable to convex analytic formulation. It turns out that stochastic 
\ dominance constraints are natural risk constraints for MDPs. 

Convex analytic methods are well studied for Markov decision processes. The linear programming ap- 
\Q ', proach for MDPs is pioneered in [3UJ , and an early survey is found in [3J . The main idea is that some MDPs 

can be written as convex optimization problems in terms of appropriate occupation measures. [5J 1211 |SJ 125) 
discuss a rigorous theory of convex optimization for MDPs with general Borel state and action spaces. De- 
tailed monographs on Markov decision processes are found in [26, 2T1[34|. Constrained MDPs can naturally 
be embedded in this framework. Constrained discounted MDPs are explored in |18) 119). PQ is a substantial 
monograph on constrained MDPs. Constrained discounted MDPs in Borel spaces are analyzed in [22 , and 
constrained average cost MDPs in Borel spaces are developed in [23J. Infinite dimensional linear program- 
ed ■ ming plays a fundamental role in both [22, 23J, and the theory of infinite dimensional linear programming 
is developed in [2]. The special case of constraints on expected utility in discounted MDPs is considered in 
[29) . MDPs with expected constraints and pathwise constraints, also called hard constraints, are considered 
in |32) using convex analytic methods. An inventory system is detailed to motivate the theoretical results. 

Policies in MDPs induce Markov chains. Typically, policies are evaluated with respect to some measure 
of expected reward, such as long-run average reward or discounted reward. The variation/spread/dispersion 
of policies is also critical to their evaluation. Given two policies with equal expected performance, we would 
prefer the one with smaller variation in some sense. Consider a discounted portfolio optimization problem, 
for example. The expected discounted reward of an investment policy is a key performance measure; the 
downside variation of an investment policy is also a key performance measure. When rewards and costs are 
involved, the variation of a policy can also be called its risk. 
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Risk management for MDPs has been considered from many perspectives in the literature. [2U] includes 
penalties for the variance of rewards in MDPs. The optimal policy is obtained by solving a nonlinear 
programming problem in occupation measures. In |37j . the mean- variance trade-off in MDPs is further 
explored in a Pareto-optimality sense. The conditional value-at-risk of the total cost in a finite horizon 
MDPs is constrained in pQ. It is argued that convex analytic methods do not apply to this problem type and 
an offline iterative algorithm is employed to solve for the optimal policy. [35j develops Markov risk measures 
for finite horizon and infinite horizon discounted MDPs. Dynamic programming equations are derived that 
reflect the risk aversion, and policy iteration is shown to solve the infinite horizon problem. 

Our notion of risk constrained MDPs differs from this literature survey. We are interested in the empirical 
distribution of reward, rather than in its expectation, variance, or other summary statistics. Our approach is 
based on stochastic orders, which are partial orders on the space of random variables, see [33, 36J for extensive 
monographs on stochastic orders. [3J [TU] use the increasing concave stochastic order to define stochastic 
dominance constraints in single stage stochastic optimization. The increasing concave stochastic order is 
notable for its connection to risk-averse decision makers, i.e. it captures the preferences of all risk-averse 
decision makers. A benchmark random variable is introduced, and a concave random variable- valued mapping 
is constrained to dominate the benchmark in the increasing concave stochastic order. It is shown that 
increasing concave functions are the Lagrange multipliers of the dominance constraints. The dual problem 
is a search over a certain class of increasing concave functions, interpreted as utility functions, and strong 
duality is established. Stochastic dominance constraints are applied to finite horizon stochastic programming 
problems with linear system dynamics in |12) . Specifically, a stochastic dominance constraint is placed on 
a vector of state and action dependent reward functions across the finite planning horizon. The Lagrange 
multipliers of this dynamic stochastic dominance constraint are again determined to be increasing concave 
functions, and strong duality holds. In contrast, we place a stochastic dominance constraint on the empirical 
distribution of reward in infinite horizon MDPs. We argue that this type of constraint comprehensively 
accounts for the variation in policies in MDPs. 

We make two main contributions in this paper. First, we show how to formulate stochastic dominance 
constraints for long-run average reward maximizing MDPs. More immediately, we show that stochastic 
dominance constrained MDPs can be solved via linear programming over occupation measures. Our model 
is more general than [12] because it allows for an arbitrary transition kernel and is also infinite horizon. Also, 
our model is more computationally tractable than the stochastic programming model in |12| because it leads 
to linear programs. Second, we apply infinite-dimensional linear programming duality to gain more insight: 
the resulting duals are similar to the linear programming form of the average reward dynamic programming 
optimality equations. However, new decision variables corresponding to the stochastic dominance constraint 
appear in an intuitive way. Specifically, the new decision variables are increasing concave functions that 
price rewards. This observation parallels the results in [TO] [13] and is natural because our stochastic 
dominance constraints are defined in terms of increasing concave functions. The upcoming dual problems 
are themselves linear programs, unlike the dual problems in [9lll0[[l3] which are general infinite-dimensional 
convex optimization problems. 

This paper is organized as follows. In section 2, we consider stochastic dominance constraints for long- 
run average reward maximizing MDPs. In section 3 we formulate this problem as a static optimization 
problem, in fact a linear programming problem, in a space of occupation measures. Section 4 develops the 
dual for this problem using infinite dimensional linear programming duality, and reveals the form of the 
Lagrange multipliers. In section 5, we discuss a number of immediate variations and extensions, especially 
the drastically simpler development on finite state and action spaces. We illustrate our method in section 6 
with a portfolio optimization example, and then conclude the paper in section 7. 

2 MDPs and stochastic dominance 

The first subsection presents a general model for average reward MDPs, and the second explains how to 
apply stochastic dominance constraints. 
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2.1 Average reward MDPs 

A typical representation of a discrete time MDP is the 5-tuple 

(S,A, {A (a) : seS},Q,r). 

The state space S and the action space A are Borel spaces, subsets of complete and separable metric spaces, 
with corresponding Borel a— algebras B (S) and B(A). We define V (S) to be the space of probability 
measures over S with respect to B(S), and we define V (A) analogously. For each state s £ S, the set 
A (s) C A is a measurable set in B (A) and indicates the set of feasible actions available in state s. The set 
of feasible state-action pairs is written 

K = {(a, a) £ S x A: a £ A (s)} , 

and K is assumed to be closed in S x A. The transition law Q governs the system evolution. Explicitly, 
Q (B | s, a) for £? £ B (S) is the probability of visiting the set B given the state-action pair (s, a). Finally, 
r : K — > R is a measurable reward function that depends on state-action pairs. 

We now describe two classes of policies for MDPs. Let H t be the set of histories at time t, H = S, 
Hi = K x S, and H t = K l x S for all t > 2. A specific history /i t £ i/j records the state-action pairs visited 
at times 0, 1, . . . , t — 1 and the current state St- Define II to be the set of all history- dependent randomized 
policies: collections of mappings 7r t : H t — > V (A) for all t > 0. Given a history h t G H t and a set B 6 B (A), 
tt (B | ht) is the probability of selecting an action in B. Define to be the class of stationary randomized 
Markov policies: mappings <j> : S V (A) which only depend on history through the current state. For a 
given state s £ S and a set B £ B (A), <j)(B \ a) is the probability of choosing an action in B. The class $ 
will be viewed as a subset of II. We explicitly assume that both II and $ only include feasible policies that 
respect the constraints K. 

The state and action at time t are denoted s t and a t , respectively. Any policy n £ II and initial 
distribution v £ V (S) determines a probability measure P£ and stochastic process {(st,at) , t > 0} defined 
on a measurable space (17, T). The expectation operator with respect to is denoted [•]. Consider the 
long-run expected average reward 

fl(7r,i/) = ]nninfiES 

T— yoo 1 

The classic long-run expected average reward maximization problem is 



T-l 

E 5 



(s f ,a t ) 



SUP i? (7T, I/) 

S.t. 7T £ n. 



(2.1) 
(2.2) 



It is known that a stationary policy in $ is optimal for problem (|2.ip - (12. 2[) under suitable conditions (this 
result is found in [34] for finite and countable state spaces, and [26l [27] for general Borel state and action 
spaces). 



2.2 Stochastic dominance 

Now we will motivate and formalize stochastic dominance constraints for problem (|2.ip - (12. 2p . To begin, 
let # : K M be another measurable reward function, possibly different from r. A risk-averse decision 
maker with an increasing concave utility function u : R — >• K would be interested in maximizing his long-run 
average expected utility 

' ^ u{z(s u a t )) 



liminf-EJ 



,t=0 



However, it is difficult to choose one utility function to represent a risk-averse decision maker without 
considerable information. We will use the increasing concave order to express a continuum of risk preferences 
in MDPs. 
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Definition 2.1. For random variables X, Y £ R, X dominates Y in the increasing concave stochastic order, 
written X >i cv Y, if E [u (X)] > E [u (Y)] for all increasing concave functions u : R —> R such that both 
expectations exist. 

Let C (R) be the set of all continuous functions / : R — > R. Let U (R) C C (R) be the set of all increasing 
concave functions u : R — > R such that 

lim u (x) = 

and 

u (x) — u (xq) + k (x — Xq) 

for all x < xq for some n > and ij 6 M (the choices of k and xo differ among u). The second condition 
just means that all u £ U (R) become linear asi-> — oo. By construction, functions u eW (R) are bounded 
from above by zero. We will use the set U (R) to characterize X >i CV Y. 

Now define (x)_ = min {x, 0}. We note that any function in U (R) can be written in terms of the family 
{(x — rj)_ : n £ R}. To understand this result, choose u £ U (R) and a finite set of points {xi, . . . ,Xj}. By 
concavity, there exist a; £ 1 such that aj (x — xi) + u (xj) > u (x) for all x £ R and for all i = 1, . . . ,j. Each 
linear function ai (x — Xi) + u (xj) is a global over-estimator of u. The piecewise linear increasing concave 
function 

min {a.i(x - Xi) + u(x.i)} 
i=i.... ,j 

is also a global over- estimator of u, and certainly 

u (x) < min {a^ (x — Xi) + u (x^)} < (x — Xi) + u (xj) 
i=l,...,j 

for all i = 1, . . . , j and x G R. As the number of sample points j increases, the polyhedral concave func- 
tion minj = i j {a^ (x — x^) + u (x^)} becomes a better approximation of u. We realize that the function 
Jnin< = i j {cii (x — Xi) + m (xi)} is equal to a finite sum of nonnegative scalar multiples of functions from 
{(x — rj)_ : 7] £ R}. It follows that the relation X > icv Y is equivalent to E [(X - rj)_] > E [(Y - for 
all r] £ R. When the support of Y is contained in a compact interval [a, b], the condition E [(X —T])_] > 
E [(Y — »7)_] for all rj £ [a, b] is sufficient for X >i CV Y. 

From now on, let Y be a fixed reference random variable on R to benchmark the empirical distribution 
of reward z. We assume that Y has support in a compact interval [a, b] throughout the rest of this paper. 
Define 



Z„ (ir.v) = liminf-E" 



'T-1 



to be the long-run expected average shortfall in z at level r\. We propose the class of stochastic dominance- 
constrained MDPs: 



sup R(n,v) (2.3) 
s.t. Z v (n,v) >E[(Y-77)_] , V77 G [a, b] , (2.4) 

7r g n. (2.5) 

For emphasis, we index rj over the compact set [a, b] in (|2.4p . Allowing r\ to range over all R would lead to 
major technical difficulties, as first observed in [51 IIP). 

Constraint (|2.5p is a continuum of constraints on the long-run expected average shortfall of the policy 7r for 
all rj £ [a, b] . We will approach problem (|2.3p - (|2.5p by casting it in the space of long-run average occupation 
measures. Then we will see that constraint (|2.4p is equivalent to a stochastic dominance constraint on the 
empirical distribution of rewards z, namely 

1 T_1 

lim — } z(s t ,a t ) > icv Y. 

4=0 
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To be clear, hin^^oo ~ Y^t=o z ( s tj a t) indicates a random variable on R, not the long-run average of z (s t , a t ). 
We can denote the feasible region of problem (|2.3p - (|2.5[) succinctly as 

A = {(tt, v) G n x V (S) : R (tt, i/) > -oo and Z r) (tt, i/) > E [(Y - r))_] for all 77 € [a, 6]} , 
allowing problem (|2.3p - (|2.5p to be written as 

p* = sup {i? (7r, v) : (tt, v) G A} , 

where p* is the optimal value. 

Remark 2.2. We focus on the average reward case in this paper. The extension to the average cost case is 
immediate. Let c: 5 x A 1 be a measurable cost function. The long-run expected average cost is 




Similarly, let 3 : S x A — > R be another measurable cost function that possibly differs from c. Since 3 
represents costs, we want the empirical distribution of 3 to be "small" in a stochastic sense. For costs, it is 
logical to use the increasing convex order rather than the increasing concave order. For random variables 
X, Y G R, X dominates Y in the increasing convex stochastic order, written X >i cx Y, if E [/ (X)] > 
E [/ (Y)] for all increasing convex functions / : R — > R such that both expectations exist. Define (x) + = 
max{x,0}, and recall that the relation X >i CX Y is equivalent to E [(X — rf) , 1 > E [(Y — rf) , 1 for all 
77 G R. When the support of Y is contained in an interval [a, b], the relation X >i CX Y is equivalent to 
E [(X - rj) + ] > E [(Y — n) + ] for all i] G [a, b}. 

Momentarily, let Y be a benchmark random variable that we require to dominate the empirical distribu- 
tion of 3. Define 




for all rj G [a, b]. We obtain the cost minimization problem 
inf C (tt, v) 

s.t. 3„(7r,^) <E[(Y- V ) + ] , Vr?G[a,&], 

7r g n. 

The upcoming results of this paper all have immediate analogs for the average cost case. 

3 A linear programming formulation 

This section develops problem (|2.3I) - (|2.5[) as an infinite dimensional linear program. First, we discuss 
occupation measures on the set K. Occupation measures on K can be interpreted as the long-run average 
expected number of visits of a stochastic process {(st,at) , t > 0} to each state-action pair. Next, we argue 
that a stationary policy in $ is optimal for problem (|2.3| - (|2.5| . It will follow that the functions R (<j>, v) 
and Z v (4>, v) can be written as linear functions of the occupation measure corresponding to (f> and v. These 
linear functions give us the desired linear program. 

To proceed, we recall several well known results in convex analytic methods for MDPs. We will use 
fj, to denote probability measures on K, and the set of all probability measures on K is denoted V(K). 
Probability measures on K can be equivalently viewed as probability measures on all of S x A with all mass 
concentrated on K, p(K) = 1. For any /i G V(K), the marginal of fi on S is the probability measure 
fieP(S) defined by /2 (B) = fi (B x A) for all B G B (S). 

The following two well known facts are ubiquitous in the literature on convex analytic methods for MDPs 
(see |15| for example). First, if fi is a probability measure on K, then there exists a stationary randomized 
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Markov policy (j) £ $ such that [i can be disintegrated as ji = p ■ (f> where p is the marginal of [i. Specifically, 
is = p • 4> is defined by 



(i(BxC) = / 4(C | s)/t (ds) 
is 

for all B 6 £? (S) and C £ B (A). Second, for each <f> £ $ and v£P (5), the probability measure [i = v 
on 5 x A satisfies /i (if) = 1 and jl = v. Specifically, /u = 1/ • <f> is defined by 

fj,(BxC)= [ 4>{C\ s)v(ds) 

J B 

for all 5 6 £ (5) and C £ B (A). 

We can integrate measurable functions / on K with respect to measures fj, EV (K) . Define 



((J>,f) = / / (s,a)fi(d(s,a)) 
J K 

as the integral of / over state-action pairs (s, a) £ K with respect to [i. Then 



(fi, r) = r (s,a) fi(d{s,a)) 
J K 

is the expected reward with respect to the probability measure \i and 



(Mi ( z - V)-) = / ( z (s, a)-v)- V ( d (s, a)) 
J K 

is the expected shortfall in z at level rj with respect to the probability measure /i. 

We need to restrict to a certain class of probability measures. For notational convenience, define r (s, 4>) = 
J A r (s, a) (j) (da \ s) and Q (• | s, <p) = J A Q (• | s, a) (da \ s). 

Definition 3.1. [2"3"l Definition 3.4] A probability measure fj, = p, • <j> is called stable if 

(fi,r) = / r (s, a) [i (d (s, a)) > — oo 



and the marginal p is invariant with respect to Q (• | •, </>), i.e. /} (i?) = f s Q (B \ s,4>) p (ds) for all B £ B (S). 
When fi is stable, the long-run expected average cost R (<p, pi) is 

nr-i 



/}) = liminf-E? 



by the individual ergodic theorem |38[ Page 388, Theorem 6]. Then for stable fx = p ■ <p £ V (K), it follows 
that 

R (4>, P) = (Mi r ) = j r(s,<j))p(ds) . 
Js 

Similarly, for stable fj, = p ■ cf>, it is true that 



Z v ((f), p) = (ii, (z - r))_) 



(z (s, a) — T))_<f> (da | s) 



p (ds) 



for all r) £ [a, b\. 

To see the connection between problem (|2.3[) - (|2.5I) and stable policies, let Ir be the indicator function 
of a set r in B (K). Define the occupation measure /i on K via 

T-l T-l 

<t (T) = = E C {/r (*,<*)}= - £ C {(**, «0 e r} 



t=o 



G 



for all r G B(K). Then, 



R(ir,v) = liminf-E^ 



T-l 
.4=0 



liminf(/^ T ,r) 

T— >oo 



= lrniinf(^ x ,(2-T/)_) 

T— >oo ' 



and 

i F" 1 

Z n ((/>,p,) = hminf— E£ V (z(s t ,a t ) -77), 

T— >oo i * — ' 

Lt=o 
for all i] G [a, 6]. 

To continue, we introduce some technical assumptions for the rest of the paper. Let Cb (K) be the space 
of continuous and bounded functions on K. The transition law Q is defined to be weakly continuous when 
Is h (0 Q (d£ I •) is in C b (K) for all h e C b (K). 

Assumption 3.2. (a) Problem 12. 3\) - \2.5\) is consistent, i.e. the set A is nonempty. 

(b) The reward function r is nonpositive, and for any e > the set {(s, a) G S x A : r(s,a)> — e} is 
compact. 

(c) The function z (s, a) is bounded and upper semi- continuous on S x A. 

(d) The transition law Q is weakly continuous. 

A function / on K is called a moment if there exists a nondecreasing sequence of compact sets K n f K 
such that 

lim inf / (s, a) = 00, 

see [26l Definition E.7]. When if is compact, then any function on if is a moment. Assumption 13.21^ 
implies that — r is a moment. By construction, all of the functions (z (s, a) — rf)_ are bounded above by zero 
on S x A for all 77 G [a, 6]. 

The next lemma reduces the search for optimal policies to stable policies. We define 

A s = {(i G V (K) : /i is stable, /i = fi ■ <j> and (</>, p.) G A} 
to be the set of all stable probability measures /1 that are feasible for problem (|2.3|) - (|2.5[) . 



Lemma 3.3. Suppose assumvtion \3.2\ holds. For each feasible pair (-7T, v) G A, /j/iere exists a stable probability 
measure fi = ft ■ <f> such that ((/>, fi) G A and R (-7T, < R (</>, //) = (fx, r) . 

Proof. For any (it, v) G A, there exists a stable policy fi = jl ■ <f> such that 

R(n,v) < R(<f>,p,) = (fi,r) 

by [261 Lemma 5.7.10]. By the same reasoning, 

E [(Y - n)_] < Z v (tt, v) < Z v (0, A) - (A*, (z - fj)_) 

for all r\ G [a, fe] so that fi = jj, ■ <fi is feasible. □ 

Problem (|2.3p - (|2.5j) is solvable if there exists a pair (71-*, v*) G A with R(n* , v*) — p*, i.e. the optimal 
value is attained. When an optimization problem is solvable, we can replace 'sup' and 'inf with 'max' and 
'min'. We use the preceding lemma to show that problem (|2.3|) - ()2.5|1 is solvable. 



Theorem 3.4. Problem 12. 3\) - \2.5\) is solvable. 
Proof. By lemma [3~31 

p* = sup{(/j,r) : p G A s } . 

Now apply the proof of |261 Theorem 5.7.9]. Let {e„} be a sequence with e n 1 and e n < 1. For any e„, 
there is a pair (7r n , i/ n ) G A with R (ir n , v n ) > p* — e n by the definition of p*. Again, by lemma [3~3l for each 
(7r n ,i/ 1 ) G A there is a pair ((f> n ,p n ) G A such that p n = p n ■ (f> n is stable and R(-K n ,v n ) < R((f) n ,p n ) = 
(p n ,r). 
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By construction, (p n ,r) > p* — e n and e„ G (0,1) for all n, so mi n (p n ,r) > p* — 1. It follows that 
sup„(/i", — r) < 1 — p*. Since — r is a moment, the preceding inequality along with |26i Proposition E.8] 
and |26[ Proposition E.6] imply that there exists a subsequence of measures {p ni } converging weakly to a 
measure p on K. Now 

p* < lim sup (p ni , r) 

i— >oo 

holds since (p n , r) > p* — e n for all n and e„ 4- 0- By [2^1 Proposition E.2], 

lim sup (p ni , r) < {p,r), 

i— >oo 

so we obtain 

P* < (M,r). 

Since (/i, r) < p* must hold by definition of p* , the preceding inequality shows that {p, r) = p*, i.e. /it attains 
the optimal value p* and is stable. By a similar argument, 

E[(y -??)_] < ]imsup(p, ni ,(z-T])_) < (p,{z-r])_) 

i— yea 

since each (p ni , (z — T))_) > E [(y — rj)_] for all j and all 77 G [a, b]. Thus, p is feasible. 

Let p* be the optimal stable measure just guaranteed, and disintegrate to obtain p* = p* ■ <p* . The pair 
((/)*, p.*) is then optimal for problem (|2.3[) - (|2.5p since 

i? (</)*, /T) = (^*,r) = p*, 

and 

Z„ (0*, A*) = (/A - »7)_> > E [(y - 
for all 77 G [a, i>]. □ 

From the preceding theorem, we can now write maximization instead of supremum in the objective of 
problem - (|23| . 



p* = max {R (it, v) : (n, i/)eA s }. 
We are now ready to formalize problem (|2.3[) - (|2.5[) as a linear program. Introduce the weight function 

w (s, a) = 1 — r (s, a) 

on K. Under our assumption that r is nonpositive, w is bounded from below by one. The space of signed 
Borel measures on K is denoted M. (K). With the preceding weight function, define M. w (K) to be the space 
of signed measures p on K such that 



IHIm«,(*0 = / w (s, a) \p,\(d(s, a)) < 00. 



K 



We can identify elements in Ai w (K) with stable policies, and vice versa. First, observe that the space 
A4 W (K) is contained in the set of stable probability measures. If HmIImwCK") < then certainly 



(p,r) = / r (s, a) p (d (s, a)) > — 00 

J K 

since 1 — r = w. Conversely, if p is a stable probability measure, then it is an element of M w (K) since 
w (s, a) \p\ (d (s, a))= I (1 - r (s, a))p (d (s, a)) = p {K) - {p, r) < 00. 

K J K 

Also define the weight function 

w (s) = 1 — sup r(s, a) 



on S which is also bounded from below by one. The space Mw (S) is defined analogously with w and S in 
place of w and S x A. 

The topological dual of A4 W (K) is T w (K), the vector space of measurable functions h : K — y K such 
that 

ii , I, a \h(s,a) | 

(s,a)eK w(s,a) 

Certainly, r G T w (K) by definition of w since 

|r (s, a) I |r (s, a) 

r 7.(Jt)= SU P 7 T = SU P f \ l — 

(s,a)€K w(s,a) ( s , a )eK 1 +\r[s,a)\ 

Every element h G J- w (K) induces a continuous linear functional on M w (K) defined by 

(fj,,h) = / h (s, a) im (d ((s, a))) . 

The two spaces (.Mu, (if) , J-"™ (-fC)) are called a dual pair, and the duality pairing is the bilinear form 
(u,h) : M w (K) x T w (K) — > K just defined. The topological dual of A4w (S) is Tw (S), which is defined 
analogously with S and w in place of K and w. 

We can now make some additional technical assumptions. 

Assumption 3.5. (a) The function (z — rj)_ is an element of T w (K) for all r\ G [a, b\. 

(b) The function J s w (£) Q (d£ | s, a) : S x A — 5- K is an element of JF W (K). 

Notice that assumption 13.5( a) is satisfied if z € T w (K). To see this fact, reason that 

II ( z ~ V)- \\r m (K) < \\Z ~ V\\r w (K) < \\z\\r„(K) + h\\T w (K), 

where the first inequality follows from \(z — r/)_\ < \z — rj\. The constant function / (x) = r\ on K is in 
F w (K) since 

\\V\\^(K)= sup -B—<\t,\. 

The linear mapping Lq : M w (K) — > Ai w (S) defined by 

[Lou] (B) = p,(B)-f Q(B \ s,a)n(d (s, a)) , V-B G B (S) , (3.1) 

is used to verify that /i is an invariant probability measure on K with respect to Q. The mapping (|3.1[) 
appears in all work on convex analytic methods for long-run average reward/cost MDPs. When LoM (B) = 0, 
it means that the long-run proportion of time in state B is equal to the rate at which the system transitions 
to state B from all state-action pairs (s,a) G K. 

Lemma 3.6. The condition [i G A s is equivalent to (fi, r) > — oo and 

L [i = 0, 

(M> = i, 

(n,(z-ri)_)>E[(Y-Ti)_] ,V7?G [a,b], 
H > 0. 

Proof. The linear constraints (fJ.,1) — J K fi(d(s,a)) = 1 and /i > just ensure that fi is a probability 
measure on if. The condition LoA 1 — is equivalent to invariance of /i with respect to Q. For stable 
H = p, ■ <f>, R((/),fi) = (fi,r) > — oo and Z r) (0,/2) = (fi, (z — Since Z^{4>,(i) > K\(Y — rf)_] for all 

77 G [a,b], the conclusion follows. □ 



Next we continue with the representation of the dominance constraints (|2 .4|) . We would like to express 
the constraints (z — rf)_) > E \(Y — rj)_] for all n G [a, b] through a single linear operator. 
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Lemma 3.7. For any fi G V (K), (fj,, (z — rj)_) is uniformly continuous in rj on [a, b\. 

Proof. Write (fi, (z — f])_) = J K (z (s, a) — rj)_ fx (d (s, a)). Certainly, each function (z (s, a) — rj)_ is contin- 
uous in 77 for fixed s x a. Choose e > and \r)' — r/\ < e. Then 

(z (s, a) - rjf)_ - (s, a) - r))_ | 
<\z (s,a) - rl - z (s,a) + rj\ 

by definition of {x)_. It follows that 



1/ (z(s,a) -T]')_n(d(s,a)) - / (z (s,a) - r))_ (i(d(s,a))\ 

JSxA JK 

<| / e/x(d(*,o))l 
=e, 

since /i is a probability measure. □ 

The preceding lemma allows us to write the dominance constraints (|2.4p as a linear operator in the space 
of continuous functions. Recall that we have assumed [a, b] to be a compact set. Let C ([a, b]) be the space 
of continuous functions on [a, b] in the supremum norm, 

ll/llc(M]) = sup \f{x) | 

a<x<b 

for / € C ([a, b]). The topological dual of C ([a, b)) is M. ([a, b]), the space of finite signed Borel measures on 
[a, b]. Every measure A G M ([a, b]) induces a continuous linear functional on C {[a, b]) through the bilinear 
form 

(A,/)= f f(v)A(d V ). 



Define the linear operator L\ : M w (K) — > C ([a, b)) by 



[ii/i] (77) 4 Oi, (z-tj)J, Vr ? G[a,6]. (3.2) 

Also define the continuous function y £ C ([a, 6]) where y (77) = E [(Y — 77) _] is the shortfall in Y at level 77 
for all 77 G [a, b\. The dominance constraints are then equivalent to [Lift] (77) > y (77) for all 77 G [a, b], which 
can be written as the single inequality Li/i > y in C ([a, b]). 
The linear programming form of problem (|2.3p - ()2.5j) is 



max (/j, r) (3-3) 

s.t. L M = 0, (3.4) 

(72, 1) = 1, (3.5) 

Li/i > (3.6) 

71 G M w (A") , /* > 0. (3.7) 



Since p* = max {R (jr, v) : (7r, v) G A s }, and stable probability measures on K can be identified as elements 
of M w (K), problem {575} - (pTT) is equivalent to problem - ([53]) . 
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4 Establishing strong duality 

In this section we apply infinite-dimensional linear programming duality to obtain the strong dual to problem 
(|3.3p - (|3.7p . The development in [5] is behind our duality development, and the duality theory for linear 
programming for MDPs on Borel spaces in general. 

We will introduce Lagrange multipliers for constraints (|3 .4[) , (13. 5p , and (|3.6p , each Lagrange multiplier is 
drawn from the appropriate topological dual space. Introduce Lagrange multipliers h G Tw (S) for constraint 
(|3.4p . The constraint (/x, 1) = 1 is an equality in R, so introduce Lagrange multipliers /3 G R for constraint 
(|3.5p . Finally, introduce Lagrange multipliers A G .M ([a, b]) for constraints (|3.6p . The Lagrange multipliers 
(h,(3,A) G J 7 ™ (S) x R x .M ([a, &]) will be the decision variables in the upcoming dual to problem (|3.3p - 

(E2D. 

To proceed with duality, we compute the adjoints of Lq and L±. The adjoint is analogous to the transpose 
for linear operators in Euclidean spaces. 

Lemma 4.1. (a) The adjoint of Lq is Lq : J- w (S) —> T w (K) where 

[L*h}(s,a)±h(s)- f h(OQ(<%\s, a) 



for all (s, a) G K . 

(b) The adjoint of L\ is L\ : M. ([a, b]) — > J- w (K) where 



[LJA] (s,a) 



(z (s,a) - rj)_ A (d(s,a)) . 



Proof, (a) This result is well known, see [2(31 127| . 
(b) Write 



(A, Lot) 



{ft, (z-r))_)A(dr)) 



(z(s,a) -r))_)fi(d(s,a)) 



K 



A(dr}). 



When z is bounded on S x A, then 

(z (s, a) - ri)_ (// x A) (d ((s, a) x r/)) | 



K 



(z (s,a) - T])_ 



w (s, a) (fi x A) (d ((s, a) x 77)) | 



k w(s,a) 

<\\ (z-V)- \\? W (K) M\M v (K)\\A\\ M ([a,b]) 

<oo, 

since = 1 an d 1 1 vV 1 1 ([«.,&] ) < 00. The Fubini theorem applies to justify interchange of the order of 

integration, 



(A, L\x) = 



i) r 



0(s,a) -r))_)/j,(d(s,a)) 



A(drj) 



K 



K 



revealing L\ : M ([a, &]) ~> T w (K) 
We obtain the dual to problem 



(z (s, a) - n)_)A (drj) [i (d (s, a)) 
(A, (z (s, a) -r?)_)/i(d(s,a)) , 

(|3.7p in the next theorem. 



□ 
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Theorem 4.2. The dual to problem (TO)) - iTO)) is 

inf /3-{A,y) (4.1) 
s.<. r + L^-/?l + LtA<0, (4.2) 
(ft, 8, A) G Tu, (S) x R x M ([a, &]) , A > 0. (4.3) 

Proof. The Lagrangian for problem (|3.3p - (|3.7p is 

(/x, ft, /3, A) 4 ( M , r) + (ft, Lo^) + £ «,M) - 1) + (A, Li/* - y) , 

allowing problem (|3.3p - (|3.7p to be expressed as 

max | inf {i? (it, ft, 8, A) : A > 0} : a > 

HZM W (K) l(h,/3,A)£J r u,(S)xKxM([a,b]) 

We rearrange the Lagrangian to obtain 

(m, ft, /8, A) =(/i, r) + (ft, Lo/x) + P ((M, 1) - 1) + (A, LiM - y) 

= (M, r> + {L*h, n) + (ji,l31)-P+ (LJ A, fj) - (A, y) 
=(//, r + Z^ft + £ 1 + Ir*A) - /3 — (A, y). 



The dual to problem (|373|) - (|3T7[) is then 

inf i max {0 (u, ft, 3, A) : a > 0} : A > 

Since ^ > 0, the constraint r + Lq/i + /? 1 + L*A < is implied. Since B is unrestricted, take 8 — —B to get 
the desired form. □ 

We write problem (|4.ip - (|4.3p with the infimum objective rather than the minimization objective because 
we must verify that the optimal value is attained. The dual problem (|4.ip - (|4.3p is explicitly 



,6 

inf 8- E [(Y - r))_] A {drj) 

J a 



(4.4) 



s.t. r(s,a)+ (z{s,a)-<r])_A(dr))<l3 + h(s)- h(Q Q (d£ \ s, a) , V(s,a)eK, (4.5) 

J a JS 

(ft, /3, A) e J,i (S) x I x M ([a, 6] ) , A > 0. (4.6) 



Since r < 0, problem (|4.4p - (|4.6p is readily seen to be consistent by choosing ft = 0, /3 = 0, and A = 0. 

Problem (|4.4p - (|4.6I1 has another, more intuitive form. In [HI HUH!], it is recognized that the Lagrange 
multipliers of stochastic dominance constraints are utility functions. This result is true in our case as well. 
Using the family {(x — rf)_ : r\ G [a, &]}, any measure A G M. ([a, b]) induces an increasing concave function 
in C ([a, b]) defined by 

f b 

u(x)— / (x — t})_ A (drf) 



for all x G R. In fact, the above definition of u gives a function in C (R) as well. Define 

U([a,b]) =cl cone { (a; — rj) _ : r\ G [a, 6] } 

ft ~1 



= |u(x) = y (x-r?)_A(d??) for A G M ([a,b]) , A > Oj 

to be the closure of the cone generated by the family Ux — rf)_ : r\ G [a, &]}. The set W ([a, &]) (R) is 
the set of all utility functions that can be constructed by limits of sums of scalar multiplies of functions in 
{(a; -rj)_ : rj G [a,b]}. 
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Corollary 4.3. Problem (|4.4|) - (|4.6p is equivalent to 

inf P - E [it (y)] (4.7) 
s.t. r(s,a)+u(z(s,a)) <P + h(s)- h(£) Q (d£\ s, a) , W(s,a)eK, (4.8) 



(h,P,u)eJ r . a (S)xRxU([a,b]). (4.9) 
Proof. Notice that the function 

,6 

it (x) = / (a; — rj)_A(drj) 



is an increasing concave function in x for any A £ A4 ([a,b]) with A > 0. By using this definition of u, we 
see that for each state-action pair (s, a), 

(A, (z (s, a) - r?)_) = / (z (s, a) - ?y)_ A (dr?) = it (z (s, a)) . 

./ a 

Further, we can apply the Fubini theorem again to obtain 



rb 

(A, y) = I E[(y-r ? )_]A(d 77 ) = E 



(y-??)_ A(dTy) 



= E[u(Y)}. 



□ 



Next we verify that there is no duality gap between the primal problem Q3.3P - (|3.7p and its dual (|4.ip 
- (|4.3p . All three dual problems P~Tj) - (|3T3"j) . ([4^4]) - (|4T5|) . and (|477|) - P~§j) are equivalent so the upcoming 
results apply to all of them. 

The following result states that the optimal values of problems Q3.3P - (|3.7p and (|4.ip - (|4.3p are equal. 
Afterwards, we will show that the optimal value of problem (|4.1|) - (|4.3p is attained, establishing strong 
duality. 

Theorem 4.4. T/ie optimal values of problems 13. 3]) - \3. 7\) and \4- lty - &4-3\ ) are equal, 

p* = max {R (it, v) : (ft,v) G A} 

= inf {P - (A, y) : gj]), [h,p,A) GJ»(5) xRxX([«,t]), A > 0} . 

Proof. Apply [27, Theorem 12.3.4], which in turn follows from O Theorem 3.9]. Introduce slack variables 
a G C ([a, b]) for the dominance constraints Li/i > y. We must show that the set 

H±{(L Ql i, (m, 1), L lX - a, (n,r)-Q : yt > 0, a > 0, C > 0} 

is weakly closed (closed in the weak topology). Let (D, <) be a directed (partially ordered) set, and consider 
a net 

{(fi K ,a K ,( K ) ■ k G D} 
where p K > 0, a K > 0, and Ck > in M. w (K) x R x C ([a, b}) such that 

(Lo/j, k , (jj, K , 1), Li/x K - a K , (fx K , r) - ( K ) 

has weak limit (f*,7 *, /*, p*) G .Ma (S) x M x C ([a, b}) x R. Specifically, 

and 

since weak convergence on R is equivalent to the usual notion of convergence, 
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for all g G J 7 ^ (5), and 

(Lifj, K -a K ,A) -> (/*,A) 

for all A G .M ([a, 6]). We must show that [y* , 7*, /*, p*) G H under these conditions, i.e. that there exist 
x > 0, a > 0, and £ > such that 

v* = L M, 7* = (Mi !)) /* = ^1/" P* = (Mi r ) ~ C- 
The fact that there exist /j > and ( > such that 

^* = ^oM, 7* = (Mj !). P* = (Pi r ) - Ci 

is already established in |27[ Theorem 12.3.4], and applies to our setting without modification. 

It remains to verify that there exists a G C ([a, b]) with a > and /* = L\p — a. Choose A = 8 n for the 
Dirac delta function at 77 G [a, b] to see that 

[Lifi K ] (v) - a K (77) -¥ /* (77) 

for all 77 G [a, 6], establishing pointwise convergence. Pointwise convergence on a compact set implies uniform 
convergence, so in fact 

Lifi K - a K ->• /* 

in the supremum norm topology on C ([a, b]). Since L\fi K G C([a,b]) and /* G C([a, &]), it follows that 
LifJ>K — /* G C([a, £>]) for any k. Define a K = L\\x K — f* and a = L\p — /*, and notice that a > 
necessarily. □ 

The next theorem shows that the dual problem (|4.1I) - f)4.3[) is solvable, i.e. there exists (h* , (3* , A*) 
satisfying r + L^h* — (3* 1 + L\A* < that attain the optimal value 

P*-(A*,y)=p*. 

When problem (|4.1|) - (|4.3I1 is solvable, we are justified in saying that strong duality holds: the optimal values 
of both problems (13.31) - (|3.7[) and (|4.ip - (|4. 3|) are equal and both problems attain their optimal value. 
To continue we make some assumptions in line with |23) . 

Assumption 4.5. There exists a minimizing sequence (h n ,/3 n ,A n ) in problem J^.l[ j - \4-3\ ) smc/i i/iai 
faj {/?"} is bounded in M, 
f&J {/i 71 } is bounded in J 7 ^ (S), and 

(c) {A"} is bounded in the weak* topology on M ([a, 6]). 

We establish strong duality next. To reiterate, strong duality holds when the optimal values of problems 
(|3.3j) - (|3.7j) and (|4.1j) - (|4.3p are equal, and both problems are solvable. 



Theorem 4.6. Suppose assumption ^- 5\ holds. Strong duality holds between problem $3.3\) - iS. 7j) and problem 

o - ra- 

Proof. Let (h n , (3 n , A") G J 7 ^ (£) xMxM ([a, 6]) for n > be a minimizing sequence of triples given in the 
preceding assumption 14.51 

r (s, a) + [ (z (s, a) - 7i)_ A" (dr)) < p n + h n (s) - f h n (£) Q (d£ | s, a) , V (s, a) G if, 

J a J S 

for all 77. > and 

/3 n - rE[(r-77)_]A n (d7;)|p*. 

J a 

Since the sequence {/3™} is bounded, it has a convergent subsequence with lim„_j. 00 /3" = j3*. 
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Now {A™} is bounded in M. ([a, b]) in the weak* topology induced by C([a,b]) by assumption. Since 
{A™} is bounded, the sequence can be scaled to lie in the closed unit ball of M. ([a, b]) in the weak* topology. 
Since C ([a, b]) is separable (there exists a countable dense set, i.e. the polynomials with rational coefficients), 
the weak* topology on M. ([a, b]) is metrizable. By the Banach-Alaoglu theorem, it follows that {A™} has a 
subsequence that converges to some A* in the weak* topology, i.e. 

(A«,/)^(A*,/) 

for all / G C ([a, b]). In particular, since E [(Y - rj)_] and (z (s, a) - 77) _ are continuous functions on [a, b] 
for all (s,a) G K, it follows that 

r>b r>b 

lim / E [(Y - 77) _] A™ (drf) = / E [(Y - rj)_] A* (dr]) 

n 00 J a J a 

and 

lim / (z (s, a) - rf)_ A™ (drf) = / (z (s, a) - 7?)_ A* (dr}) . 

n ^>°°Ja J a 

Finally, since {h n } is bounded in J-£, (S) we can define 

ft* (s) = liminf ft" (s) 

m— >cx> 

for all s S S. Then the function ft* (s) is bounded in J 7 ^ (S 1 ), and 

liminf / ft" (0 Q (df I s, a) > / ft* ($) Q {d£_ \ s, a) 

by Fatou's lemma. Taking the limit, it follows that (ft*, /?*, A*) is an optimal solution to the dual problem. □ 

The role of the utility function u in problem f|4. 7[) - (|4.9p is fairly intuitive. The function u serves as an 
additional pricing variable for the performance function z (s, a), and the total reward is treated as if it were 
r(s,a) + u(z(s,a)). Problem (|4.7H - (14. 9|) leads to a new version of the optimality equations for average 
reward based on infinite-dimensional linear programming complementary slackness. 

Theorem 4.7. Let fi* = (1* ■ 0* be an optimal solution to problem 13. 3\) - 113. 7]) . and (h*,/3*,u*) be an 
optimal solution to problem j4-l\ ) - Then 

< M *, U * (z))=E[u* (Y)] , 



P*+h*{s)= sup ir(s,a) + u* (z(s,a))+ / h* (£) Q (d£ \ s, a) 
for fi* — almost all s G S . 

Proof. There is a corresponding optimal solution (ft*,/3*,A*) to problem (|4.ip - (|4.3|) . Complementary 
slackness between problems ((33]) - (JXTJ) and gT]) - (g3J| gives (A*, Li/i* - y) = 0, where (h*,/3*,u*) is a 
corresponding optimal solution of problem (|4.1[) - (|4.3I) . Then 

(A* ! L lM *) = (L*A*, M *) = ( M *, U * (z)) 

and (A*, y) = E [u* (Y)]. 

Complementary slackness also gives 

(r + L*ft*-/3*l + L 1 A*, M *>=0, 

which yields the second statement since /j* > and r + L^h* — /3* 1 + L*A* < 0. □ 
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5 Variations and extensions 



5.1 Multivariate integral stochastic orders 

We extend our repertoire in this section to include some additional stochastic orders. Integral stochastic 
orders (see [33]) refer to stochastic orders that are defined in terms of families of functions. The increasing 
concave stochastic order is an example of an integral stochastic order, because it is defined in terms of 
the family of increasing concave functions. We now give attention to some multivariate integral stochastic 
orders. So far, we have considered a z : K — > R that is a scalar-valued function. In practice there are usually 
many system performance measures of interest, so it is logical to consider vector valued z : K — > R™ as well. 
For example, z (s, a) may represent the service rate to n customers in a wireless network. The empirical 
distribution lim-r^oo ^ Y^t=o z ( s *; a *) ^ s now a vector- valued random variable on R™. 

Recall the multivariate increasing concave stochastic order. For random vectors X, Y G R™, X dom- 
inates Y in the increasing concave stochastic order, written X >i CV Y, if E [u (X)] > E [u (Y)] for all 
increasing concave functions u : R™ — > R such that both expectations exist. Unlike univariate >i CVl there 
is no parametrized family of functions (like (x — ?/) _ ) that generates all the multivariate increasing concave 
functions. This result rests on the fact that the set of extreme points of the increasing concave functions on 
R™ to R is dense for n > 2. see [28117]. 

As in [13], we can relax the condition X >i CV Y by constructing a tractable parametrized family of 
increasing concave functions. Let u(-; £) : R™ — > R represent a family of increasing concave functions 
parametrized by £ G 3 C MP where 3 is compact. Then, the family of functions {u (•; £)}^ 6 _ is a subset of 
all increasing concave functions and leads to a relaxation of >i CV . We say X dominates Y with respect to 
the integral stochastic order generated by {u (•; £)}j£E ^ ^ i u O^i 0] > E [w (Y; £)] for all £ G 3. Define 



Ze = Hminf-E" 



'T-l 



^2 u (z (s t , a t ) ; 



for all £ G 3. For convenience, we assume u (x; £) is continuous in £ G 3 for any x G R™. 
We propose the multivariate dominance-constrained MDP: 



sup R (-7T, v) 

s.t. Z% (n, u)>E[u (Y; £)] 

7r g n. 

using {u(-; £)} ?e s- 

By the same reasoning as earlier, 

z i 0, A) = (M) u ( z ( s > a ) ; 0) = 

for all £ G 3 when /i = /t ■ <p G A s . 



V£g3, 



(5.1) 
(5.2) 
(5.3) 



; (z (s, a) ; £) </> (da | s) 



/t (ds) 



Lemma 5.1. for any ^ € ? (-^0; ( z ! £)) * s continuous in £. 

Proof. Write (/z, w (z; £)) = J„ w (z (s, a) ; £) /x (d (s, a)). Certainly each function u (z (s, a) ; £) is continuous 
in £ for any fixed s x a. Since \i is finite, it follows that the integral of u (z (s,a) ; £) with respect to /i is 
continuous in £. □ 

Let C (3) be the space of continuous functions on 3 in the supremum norm, 

||/|| C(H )=sup|/(0|. 

We will express the dominance constraints (|5.2I) as a linear operator in C (5) . This operator depends on the 
parametrization u (•; £). The preceding lemma justifies defining L\ : A4 (S x ^4) — > C (5) by 



1G 



[L lX ](0 = (x,u(z;0), fe». 



(5.4) 



Also define the continuous function y € C (5) by y (£) = E [u (Y; £)] for all £ € 3 to represent the benchmark. 
The steady-state version of problem (|5.1|) - (|5.3I) is the modified linear program: 



max (/!, r) 
s.t. Lq[i = 0, 
<M> = 1, 

Liu > y, 

(i e M w [K) , n > 0. 



(5.5) 
(5.6) 
(5.7) 
(5.8) 
(5.9) 



Problem (|5.5p - (|5.9p is almost the same as problem (|3.3j) - (|3.7p . except that now L\[i is an element in C (3) 
to reflect the multivariate dominance constraint. 

We now compute the adjoint of Li, which depends on the choice of family {«(•; £) : £ G 3}. The 
parametrization u (•; £) will appear explicitly in this computation. 

Lemma 5.2. The adjoint of Li is L* : Ai (3) — >■ J 7 ^ ( i\T) where 

[LIA] (s,a)± £ tt («( asO );0A(de). 

Proof. Write 

(A,L 1 m)= [(ii,u(z;Z))A(dO 



u{z(s,a) ; £))/i(d(s,a)) 



A" 



A(d£). 



When z is bounded on S x i, then 

U (z; (/i x A) (d ((s, a) X f )) | < ||u (z (■) ; f) UmIU^W ll A H.M([a,&]) < °°. 

The Fubini theorem applies to justify interchange of the order of integration. 



(A,L^)=J jTu(z(«,o);0A(de) 



fi(d(s,a)) 



■/.« 



(A,u(z(s,a); £)},u (d (s, a)) . 



The dual to problem (|5.5[) - (|5.9p looks identical to problem (|4.ip 
inf P- [ E[u (Y; £)] A (d£) 



and is now explicitly 



□ 



(5.10) 



s.t. r (s, a) + / it (z (s, a) ; f ) A (d£) <(3 + h(s)- J h(£)Q(d£\ s, a) , V (s, a) £ K, (5.11) 
(/i, /3, A) G (S) x K x M (3) , A > 0. (5.12) 



Define 



W (3) =clcone {u (a:; £) : £ e 5} 

«W=/ a «(*;f)A»'»A^(H,,A>0} 
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to be the closure of the cone of functions generated by {u (x; £) : £ G S}. In this case W (S) is a family of 
functions in C (K™), the space of continuous functions / : W 1 — > R. We see immediately that problem (|5.10[) 
- (|5.12p is equivalent to 



inf /3 - E [u (Y)] 

s.t. r (s, a) + u (z (s, a)) < (3 + h (s) 
(h,P,u) e To, (S) xlxK(H). 



h(OQ(d£ | s, a) 



V(s,a) S X, 



(5.13) 
(5.14) 
(5.15) 



The variables u (^) in problem (|5.13l) - (|5 . 1 5|) are now pricing variables for the vector z. When our earlier 
assumptions are suitably adapted, then strong duality holds between problem (I5.5[) - (I5.9[) and problem (|5.13[) 
- (1535)1 . 

Theorem 5.3. The optimal values of problems h5.5\) - Ii5. 9)) and i5.1U\) - Ii5. 12\) are equal. Further, the 
dual problem H5.10\ ) - i5.1ty) is solvable and strong duality holds between problems i5.5\) - Ii5. 9]) and h5.10\) - 

5.2 Discounted reward 

We briefly sketch the development for discounted reward, it is mostly similar. Discounted cost MDPs in 
Borel spaces with finitely many constraints are considered in [22]. Introduce the discount factor 5 £ (0, 1) 
and consider the long-run expected discounted reward 



^Vr(s t ,a t ) 



R {-k,v) -- 

We are interested in the distribution of discounted reward z, 

oo 

y j 5 t z(s t ,a t ) ■ 
t=o 

Define 



Z n (tt, v) 4 K 
We propose the dominance-constrained MDP: 



^5 t {z{s t ,a t )-ri)_ 



t=o 



sup R(ir,v) (5.16) 
s.t. Z v (n,v) >E[(Y- V )_] , Vrje^b], (5.17) 

ireh (5.18) 

We work with the S— discounted expected occupation measure 

oo 

K (r)^^"fe,« ( )er) 

for allT G B(Sx A). Now let 

[Lou] (B) = A {B) - S [ Q{B | s,a) fi(d(s,a)) , VBeB(S), (5.19) 

JSxA 

and 

[LiM] fa) = -»?)_>, Vr?G[a,6]. (5.20) 
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Also continue to define y G C ([a, b}) by y (r/) = E [(Y - r])_] for all r) G [a, b}. Problem (|5.16p - (|5. 18[) is then 
equivalent to the linear program 

max (fi, r) (5.21) 

s.t. L n = v, (5.22) 

L lf i>y, (5.23) 

H G X (X), m > 0. (5.24) 

Introduce Lagrange multipliers ft G J 7 ^ (S 1 ) for constraint Lq(i — v and multipliers A e M ([a, 6]) for 
constraint L1/1 > y, the Lagrangian is then 

t? (/i, ft., A) = (/x, r) + (ft, L [i — v) + {A, Lifi - y) . 

The adjoint of Lq is Lq : (S) — > T w (S x ^4) dehned by 

[L*h] (s, a)±h(s)-6 f h (0 Q (df | s, a) . 
The adjoint of Li is still L* : M ([a, b}) — > T w (S x A) where 

KA]( S ,a)= [ (z(s,a)-r ] )_A(dr 1 ). 

J a 

The form of the dual follows. 

Theorem 5.4. The dual to problem \5.21\) - [5.2$ is 



min (ft, v) - (A, y) (5.25) 

s.t. r + L* h + L\A>Q, (5.26) 

ft G F w (K) , A G M ([a, &]) , A > 0. (5.27) 

The optimal values of problems i5.21\) - [5.2$ and Ii5.25\) - i5.27]) are equal, and problem Ii5.25\) - i5.27\) is 
solvable. 

This dual is explicitly 

min (h, v) - E [u (Y)\ (5.28) 

s.t. r(s,a) + u(z(s,a))<h(s)-5 [ h{£)Q(d£\s,a), V{s,a)eK, (5.29) 

Js 

hG F W (K), ueU([a,b]). (5.30) 



Problem (|5.28[) - (|5.30|) leads to a modified set of optimality equations for the infinite horizon discounted 
reward case, namely 



for all s £ S. 



ft (s) = max < r (s, a) + u (z (s, a)) +5 ft (£) Q (<i£ | s, a) 
aeA(» [ J s 



5.3 Approximate linear programming 

Various approaches have been put forward for solving infinite-dimensional LPs with sequences of finite- 
dimensional LPs, such as in |241 131] . Approximate linear programming (ALP) has been put forward as 
an approach to the curse of dimensionality, and it can be applied to our present setting. The average 
reward linear program (|3.3p - (|3.7p and the discounted reward linear program (|5.2ip - (|5.24p generally have 
uncountably many variables and constraints. 



19 



ALP for average cost dynamic programming is developed in |8J. Previous work on ALP for dynamic 
programming has focused on approximating the cost-to-go function h rather than the steady-state occupation 
measure /i. It is more intuitive to design basis functions for the cost-to-go function than the occupation 
measure. For problem (|3.3p - (|3.7p . we approximate the cost-to-go function h £ (S) with the basis 
functions {4>\, . . . , <j) m } C J~w (S). We approximate the pricing variable u £ U ([a, b]) with basis functions 
{ui, . . . , u n } C U ([a, b]). The resulting approximate linear program is 



min j3 — E 



22 oiiUi (Y) 



n m ~ 

X. r (s, a) + ^ OiiUi (z (s, a)) < (3 + ^ Jjhj (s) - J 
»=i j=i Js 



(7, P> a) E 



(5.31) 



(OQ(del*,a), V(s,a)£K, (5.32) 



(5.33) 



We are justified in writing minimization instead of infimum in problem (|5 . 3 1[) - (|5.33p because there are only 
finitely many decision variables. ALP has been studied extensively for the linear programming representation 
of the optimality equations for discounted infinite horizon dynamic programming (see |161 1171 114|). The 
discounted approximate linear program is 



(h, v) - E 



^diUi (Y) 



n m „ 

s.t. r {s, a) + a i u i ( z ( s ; «)) < X! Ijhj ( s ) ~ 3 / 

i=l j=l JS 



i=i 



(7, a) 6 M m x 



(5.34) 



(OQ(d£\ S ,a), V(s,a)eK, (5.35) 



(5.36) 



Both problems (|5.31[) - (15.33)) and (|5.34|) - (|5.36p are restrictions of the corresponding problems (|4.7p - (|4.9p 
and (I5T2811 - (15301) . 

Problems (|5.3ip - (|5.33|) and (|5.34|) - (|5.36p have a manageable number of decision variables but an 
intractable number of constraints. Constraint sampling has been a prominent tool in ALP, and we cite a 
relevant result now. Let 



(7z,r) +k 2 > 0, Vz££, (5.37) 

be a set of linear inequalities in the variables r £ E fe indexed by an arbitrary set C. Let ip be a probability 
distribution on £, we would like to take i.i.d. samples from C to construct a set VV C C with 

sup V ({2/ : (ly, r) +k v < 0}) < e. 

{r I (7 2 ,r)+K 2 >0,V 2 ew} 

Theorem 5.5. fF^ Theorem 2.1] For any 5 £ (0, 1) and e £ (0, 1), and 

4 /, , 12 ,2 
to > - k In h In — 



a set W 0/ to z.i.d. samples drawn from C according to distribution ip, satisfies 

sup ip ({y : {j y , r)+K y < 0}) < e 

W (7.,r)+«,>0,¥i€W} 

izra'tt probability at least 1 — 5. 

Thus, we can sample state-action pairs from any distribution ^ on if to obtain tractable relaxations of 
problems f|5.31|) - (|5.33p and ()5.34p - (|5 . 36(1 with probabilistic feasibility guarantees. Note that the number 
of samples required is O (7 hi i, In |) . 
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5.4 Finite state and action spaces 

The development for finite state and action spaces is much simpler. Now both problems Q3.3P - (|3.7p and (]4.ip 
- (|4.3p are usual linear programming problems with finitely many variables and constraints. The usual linear 
programming duality theory applies immediately to establish strong duality between these two problems. 

For this section, let x denote an occupation measure on K to emphasize that it is finite-dimensional. 
Also suppose the benchmark Y has finite support suppF = {771, . . . ,rj q } C K, so that constraint (|2.5[) is 
equivalent to 



E x [(z (s, o) -»?)_]> E [(Y -»,)_], V77 G supp r, (5.38) 
by [HI Proposition 3.2]. Each expectation 

E x [(z(s,a) -t])_] = ^2 x(s,a)(z(s,a) -rj)_ 



<s : a)£K 



is a linear function of x. 

For finite state and action spaces, the steady-state version of problem (|2.3p - (|2.5p is: 



max r (s, a) a; (s, a) (5.39) 
(s,o)eK 

s.t. ^iO»~E E i, ^ g ' a ) I M = ll. VjeS, (5.40) 

51 x(s,a) = l, (5.41) 

(s,a)e* 

E x [(«(«, a) -»;)_] >E[(Y-r))_] , V?? G suppF, (5.42) 

x > 0. (5.43) 

Duality for problem (|3 . 3|) - (|3.7p is immediate from linear programming duality. As discussed in |34l Chapter 
8] , the dual of the linear programming problem without the dominance constraints is 



mm g 

s.t. g + h{s) -J^PU I s, a)h(j) > r(s,a), V(s,a)eK, 
g el, he R |s| . 

The vector h is interpreted as the average cost-to-go function. To proceed with the dual for problem (|3~ 
(13.71) . let A e R'^ with A > and consider the piecewise linear increasing concave function 

with breakpoints at 77 G 3^- The above function u (£) can be interpreted as a utility function for a risk-averse 
decision maker. We define 

U{y) =clcone {(x - r))_ : r\ e y} 

= I u(x) = x (v) (ac — t?)_ for A G R w , A > 

to be the set of all such functions. Since y is assumed to be finite, U (y) is a finite dimensional set. 
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Theorem 5.6. The dual to problem JOg)) - [ITJty is 



min g - E [u (Y)] 

s.t. r (s, a)+u(z (s, a)) < g + h (s) - (j \ s, a) h (j) . 



V(a,o) e K, 



jes 



set, /!eM |s l ueW(^). 

Strong duality holds between problem $5.39\) - $5.43\ ) and problem fr5.44\ ) ~ j5.4°V - 
Proof. Introduce the Lagrangian 



(5.44) 
(5.45) 

(5.46) 



L(x,g,h,X) = E r(s,a)x(s,a)+g 

(s.a)eK 



E x ( s > a ) - 1 

(s.a)eK 



+ E A ^) 

Define the increasing concave function 



then 



E x U> a )-^2 E P ti I s,a)x{s,a) 

a<EA(s) seS a<EA(s) 



E z («, a) (*(*,o)-»j)_-E[(r -»;)_] 

,(s,a)GA' 



EMl) E *(*,<*)[(*(«, a) -»?)_] -E[(Y-rj)_] 

>76^ \(s,a)eif 

E *(*,o)(ii(z(«,a)))-E[ii(Y)] 

(s.a)eA' 

by interchanging finite sums. So, the Lagrangian could also be written as 



L(x,g,h,u)= E r(s,a)x(s,a) +g 
(s,a)£K 



E x(s,o)-l 
(s,o)eif 



E^O') E x ^»"E E p (j \ s ' a ) x ( s > a ) 

jeS aGA(s) s£S a£A(s) 

- x (s,a) u (z (s,a)) — K[u (Y)} , 

(s,a)eK 



for u e U. The dual to problem ([535} - (|5T4"2|l is defined as 



mm < max L(x,g,h,u) 



Rearranging the Lagrangian gives 
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L(x,g,h,u) = x(s,a 

{s,a)eK 

- g - E [u (V)] , 



r (s, a) + .g + /i (s) - P (j | s, a) /i (j) + it (z (s, a)) 



so that the dual to problem (|5.43|) - (|5.42|) is 

min — g — E [u (y)] 

s.t. r (s, a) + ff + (s) - P (j | s, o) (j) + n (z (s, a)) < 0, V(s,a)€K, 

J 6 I, /i 6 M |s| , it G W(^) . 
Since g and /i are unrestricted, take g = —g and h = —h to get the desired result. □ 

We used linear programming duality in the preceding proof for illustration. Alternatively, we could have 
just applied our general strong duality result from earlier. It is immediate that problem (|5.45D - (|5.46[) is the 
finite-dimensional version of problem (|4.7[) - (14.91) . 

There is no difficulty with the Slater condition for problems (I5.42p - (|5.43[) and (|5.45|) - (|5.46p as there 
is in [SI [TU]. In [SJ [TU], the decision variable in a stochastic program is a random variable so stochastic 
dominance constraints are nonlinear. In our case, the decision variable x is in the space of measures and the 
dominance constraints are linear. Linear programming duality does not depend on the Slater condition. 

The development for the discounted case is similar. In terms of discounted occupation measures x, 
problem ([5~T8) - (j5T7|) is 

max r (s, a) x (s, a) (5-47) 

s.t. J2 iPU \s,a)x(s,a) = a{j), Vj G S, (5.48) 

aeA s sGS a€A(s) 

E x [(z (s, a) - rj)_] >E[(Y -„)_], Vr; G y, (5.49) 

x > 0. (5.50) 

We compute the dual to problem (|5.47[) - (|5.50[) in the next theorem using the space of utility functions U 
from earlier. 

Theorem 5.7. The dual to problem $5.47\ ) - A 5. 50]) is 

min a 0') v U) - E [« 00] ( 5 - 51 ) 

s.t. v(s)-^2 J2 lP{j\s,a)v(j) >r(s,a)+u{z(s,a)), V(s,a)eK, (5.52) 

s£S aeA s 

v G u G U{y) . (5.53) 
Strong duality holds between problem \5.41\ ) - i5.5U\) and problem i5.51\) - L5.53\) . 

6 Portfolio optimization 

We use an infinite horizon discounted portfolio optimization problem to illustrate our ideas in this section. 
A single period portfolio optimization with stochastic dominance constraints is analyzed in . Specifically, 
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the model in puts a stochastic dominance constraint on the return rate of a portfolio allocation. We 
use this model as our motivation for the dynamic setting and put a stochastic dominance constraint on the 
discounted infinite horizon return rate. 

Suppose there are n assets whose prices evolve according to a discrete time Markov chain. We can include 
a risk- less asset with a constant return rate in this set. The asset prices at time t are 

p t = {p t {!),..., p t (n))6R B , 
where pt (i) is the price per share of asset i at time t. The portfolio at time t is captured by 

x t = (x t (i),..., Xt 

where x t (i) is the quantity of shares held of asset i at time t. For a cleaner model, we just treat each Xt (i) 
as a continuous decision variable. We require J27=i Xt (*) ~ 1 an< ^ Xt — f° r an t > 0, there is no shorting. 
The total wealth at time t is then (pt,Xt). 

At each time t > 0, the investor observes the current prices of the assets and then updates portfolio 
positions subject to transaction costs before new prices are realized. Let at C K™ be the buying and selling 
decisions at time t, where at (i) is the total change in the number of shares held of asset i. Define 

A (p, x) = {a e K" : x (i) + a (i) > for all « = 1, . . . , n, 
f>(i)a« = ol, 

i=l J 

to be the set of feasible reallocations given prices and holdings x. The constraint X)T=i P (*) a (*) = requires 
the total change in wealth from buying and selling decisions to be zero in any period. The system dynamic 
for portfolio positions is then 



x t (t + 1) = x t (i) + a t (i) , i = l,...,n,t> 0. 
The transaction costs c : A — !> R are defined to be 



(6.1) 



:(o) = J2 at 



this cost function is a moment on S x A. 

The overall return rate between time t and t + 1 is 

z (p t ,x f ; p t +i,x t+ i) 



(Pt+i,x t +i) - (pt,x t ) 
(Pt,x t ) 



We make the reasonable assumption that z (p t ,Xt', pt+i^xt+i) is bounded for this example. 
We want to minimize discounted transaction costs 



C (tt> v) = E£ 



t>0 



subject to a stochastic dominance constraint on the discounted return rate. Define 



(z (p t ,x t ; pt+i,x t+ i) - r?)_ 



to be the expected discounted shortfall in relative returns at level rj. We introduce a benchmark Y for the 
discounted return rate, and we suppose the support of Y is bounded within [a,b\. In this example, the 
benchmark can be taken as any market index. 
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We absorb the system dynamic (j6.1[) into a transition kernel Q. Our resulting portfolio optimization 
problem is then 



max — C (tt, v) 



(6.2) 



s.t. z v (it, i/) > e [(y - 7/)_] 



77 e [a, b] . 



(6.3) 



In the linear programming formulation of (|6.2|) - (|6.3I) . we simply augment the state space and consider 
occupation measures over sequences 



to correctly compute z. 

7 Conclusion 

We have shown how to use stochastic dominance constraints in infinite horizon MDPs. Convex analytic 
methods establish that stochastic dominance constrained MDPs can be solved via linear programming, and 
have corresponding dual linear programming problems. Conditions are given for strong duality to hold 
between these two linear programs. Utility functions appear in the dual as pricing variables corresponding 
to the stochastic dominance constraints. This result has intuitive appeal, since our stochastic dominance 
constraints are defined in terms of utility functions, and parallels earlier results [S|[TD|[T5]. Our results are 
shown to be extendable to many types of stochastic dominance constraints, particularly multivariate ones. 

There are three main directions for our future work. First, we will consider efficient strategies for 
computing the optimal policy to stochastic dominance constrained MDPs. Second, we would like explore 
other methods for modeling risk in MDPs using convex analytic methods. Specifically, we are interested 
in solving MDPs with convex risk measures and chance constraints with "static" optimization problems 
as we have done here. Third, as suggested by the portfolio example, we will consider online data-driven 
optimization for the stochastic dominance-constrained MDPs in this paper. The transition probabilities of 
underlying MDPs are not known in practice and must be learned online. 
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